IEICE global.ieice.org Site

Author Search Result

[Author] Hideharu AMANO(66hit)

41-60hit(66hit)

The Implementation of a Hybrid Router and Dynamic Switching Algorithm on a Multi-FPGA System
Tomoki SHIMIZU Kohei ITO Kensuke IIZUKA Kazuei HIRONAKA Hideharu AMANO

PAPER

Pubricized:
2022/06/30
Vol:
E105-D No:12
Page(s):
2008-2018
The multi-FPGA system known as, the Flow-in-Cloud (FiC) system, is composed of mid-range FPGAs that are directly interconnected by high-speed serial links. FiC is currently being developed as a server for multi-access edge computing (MEC), which is one of the core technologies of 5G. Because the applications of MEC are sometimes timing-critical, a static time division multiplexing (STDM) network has been used on FiC. However, the STDM network exhibits the disadvantage of decreasing link utilization, especially under light traffic. To solve this problem, we propose a hybrid router that combines packet switching for low-priority communication and STDM for high-priority communication. In our hybrid network, the packet switching uses slots that are unused by the STDM; therefore, best-effort communication by packet switching and QoS guarantee communication by the STDM can be used simultaneously. Furthermore, to improve each link utilization under a low network traffic load, we propose a dynamic communication switching algorithm. In our algorithm, each router monitors the network load metrics, and according to the metrics, timing-critical tasks select the STDM according to the metrics only when congestion occurs. This can achieve both QoS guarantee and efficient utilization of each link with a small resource overhead. In our evaluation, the dynamic algorithm was up to 24.6% faster on the execution time with a high network load compared to the packet switching on a real multi-FPGA system with 24 boards.
A Mapping Method for Multi-Process Execution on Dynamically Reconfigurable Processors
Vu MANH TUAN Hideharu AMANO

PAPER-Computer Systems

Vol:
E91-D No:9
Page(s):
2312-2322
The multi-process execution in dynamically reconfigurable processors is a technique to enhance throughput by trying to exploit more inherent parallelism of applications. Basically, a total process for an application is divided into small processes, assigned into limited areas of a reconfigurable array, and concurrently executed in a pipelined manner. In order to improve the efficiency of the multi-process execution, a systematic method for mapping processes onto a reconfigurable array consisting of multiple hardware execution units is essential. This paper proposes and investigates a systematic method for mapping an application modeled as a Kahn Process Network onto a dynamically reconfigurable processing array. In order to execute streaming applications in a pipelined manner, the size of Tiles, which is a unit area of dynamically reconfigurable array, and the grouping of processes are adjusted. Using real applications such as DCT, JPEG encoder and Turbo encoder, the impact of different versions mapped onto the NEC Dynamically Reconfigurable Processor on performance is evaluated. Evaluation results show that our proposed mapping algorithm achieves the best performance in terms of the throughput and the execution time.
Partial Reconfiguration of Flux Limiter Functions in MUSCL Scheme Using FPGA
Mohamad Sofian ABU TALIP Takayuki AKAMINE Yasunori OSANA Naoyuki FUJITA Hideharu AMANO

PAPER-Computer System

Vol:
E95-D No:10
Page(s):
2369-2376
Computational Fluid Dynamics (CFD) is used as a common design tool in the aerospace industry. UPACS, a package for CFD, is convenient for users, since a customized simulator can be built just by selecting desired functions. The problem is its computation speed, which is difficult to enhance by using the clusters due to its complex memory access patterns. As an economical solution, accelerators using FPGAs are hopeful candidate. However, the total scale of UPACS is too large to be implemented on small numbers of FPGAs. For cost efficient implementation, partial reconfiguration which dynamically loads only required functions is proposed in this paper. Here, the MUSCL scheme, which is used frequently in UPACS, is selected as a target. Partial reconfiguration is applied to the flux limiter functions (FLF) in MUSCL. Four FLFs are implemented for Turbulence MUSCL (TMUSCL) and eight FLFs are for Convection MUSCL (CMUSCL). All FLFs are developed independently and separated from the top MUSCL module. At start-up, only required FLFs are selected and deployed in the system without interfering the other modules. This implementation has successfully reduced the resource utilization by 44% to 63%. Total power consumption also reduced by 33%. Configuration speed is improved by 34-times faster as compared to full reconfiguration method. All implemented functions achieved at least 17 times speed-up performance compared with the software implementation.
Message Transfer Algorithms on the Recursive Diagonal Torus
Yulu YANG Hideharu AMANO

PAPER-Computer Systems

Vol:
E79-D No:2
Page(s):
107-116
Recursive Diagonal Torus (RDT) is a class of interconnection network for massively parallel computers with 216 nodes. In this paper, message transfer algorithms on the RDT are proposed and discussed. First, a simple one-to-one message routing algorithm called the vector routing is introduced and its practical extension called the floating vector routing is proposed. In the floating vector routing both the diameter and average distance are improved compared with the fixed vector routing. Next, broadcasting and hypercube emulation algorithm scheme on the RDT are shown. Finally, deadlock-free message routing algorithms on the RDT are discussed. By a simple modification of the e-cube routing and a small numbers of additional virtual channels, both one-to-one message transfer and broadcast can be achieved without deadlock.
High-Speed Fully-Adaptable CRC Accelerators
Amila AKAGIC Hideharu AMANO

PAPER-Computer System

Vol:
E96-D No:6
Page(s):
1299-1308
Cyclic Redundancy Check (CRC) is a well known error detection scheme used to detect corruption of digital content in digital networks and storage devices. Since it is a compute-intensive process which adversely affects performance, hardware acceleration using FPGAs has been tried and satisfactory performance has been achieved. However, recent extended usage of networks and storage systems require various correction capabilities for various CRC standards. Traditional hardware designs based on the LFSR (Linear Feedback Shift Register) tend to have fixed structure without such flexibility. Here, fully-adaptable CRC accelerator based on a table-based algorithm is proposed. The table-based algorithm is a flexible method commonly used in software implementations. It has been rarely implemented with the hardware, since it is believed that the operational speed is not enough. However, by using pipelined structure and efficient use of memory modules in FPGAs, it appeared that the table-based fixed CRC accelerators achieved better performance than traditional implementation. Based on the implementation, fully-adaptable CRC accelerator which eliminate the need for many non-adaptable CRC implementations is proposed. The accelerator has ability to process arbitrary number of input data and generates CRC for any known CRC standard, up to 65 bits of generator polynomial, during run-time. Further, we modify Table generation algorithm in order to decrease its space complexity from O(nm) to O(n). On Xilinx Virtex 6 LX550T board, the fully-adaptable accelerators occupy between 1 to 2% area to produce maximum of 289.8 Gbps at 283.1 MHz if BRAM is deployed, or between 1.6 - 14% of area for 418 Gbps at 408.9 MHz if tables are implemented in logic. Proposed architecture enables further expansion of throughput by increasing a number of input bits M processed at a time.
A Port Combination Methodology for Application-Specific Networks-on-Chip on FPGAs
Daihan WANG Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Reconfigurable System and Applications

Vol:
E90-D No:12
Page(s):
1914-1922
A temporal correlation based port combination algorithm that customizes the router design in Network-on-Chip (NoC) is proposed for reconfigurable systems in order to minimize required hardware amount. Given the traffic characteristics of the target application and the expected hardware amount reduction rate, the algorithm automatically makes the port combination plan for the networks. Since the port combination technique has the advantage of almost keeping the topology including two-surface layout, it does not affect the design of the other layer, such as task mapping and scheduling. The algorithm shows much better efficiency than the algorithm without temporal correlation. For the multimedia stream processing application, the algorithm can save 55% of the hardware amount without performance degradation, while the none temporal correlation algorithm suffers from 30% performance loss.
Recovering Faulty Non-Volatile Flip Flops for Coarse-Grained Reconfigurable Architectures
Takeharu IKEZOE Takuya KOJIMA Hideharu AMANO

PAPER

Pubricized:
2020/12/14
Vol:
E104-C No:6
Page(s):
215-225
Recent IoT devices require extremely low standby power consumption, while a certain performance is needed during the active time, and Coarse-Grained Reconfigurable Arrays (CGRAs) have received attention because of their high energy efficiency. For further reduction of the standby energy consumption of CGRAs, the leakage power for their configuration memory must be reduced. Although the power gating is a common technique, the lost data in flip-flops and memory must be retrieved after the wake-up. Recovering everything requires numerous state transitions and considerable overhead both on its execution time and energy. To address the problem, Non-volatile Cool Mega Array (NVCMA), a CGRA providing non-volatile flip-flops (NVFFs) with spin transfer torque type non-volatile memory (NVM) technology has been developed. However, in general, non-volatile memory technologies have problems with reliability. Some NVFFs are stacked-at-0/1, and cannot store the data in a certain possibility. To improve the chip yield, we propose a mapping algorithm to avoid faulty processing elements of the CGRA caused by the erroneous configuration data. Next, we also propose a method to add an error-correcting code (ECC) mechanism to NVFFs for the configuration and constant memory. The proposed method was applied to NVCMA to evaluate the availability rate and reduction of write time. By using both methods, the average availability ratio of 94.2% was achieved, while the average availability ratio of the nine applications was 0.056% when the probability of failure of the FF was 0.01. The energy for storing data becomes about 2.3 times because of the hardware overhead of ECC but the proposed method can save 8.6% of the writing power on average.
A Fine-Grained Power Gating Control on Linux Monitoring Power Consumption of Processor Functional Units
Atsushi KOSHIBA Motoki WADA Ryuichi SAKAMOTO Mikiko SATO Tsubasa KOSAKA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Hiroshi NAKAMURA Mitaro NAMIKI

PAPER

Vol:
E98-C No:7
Page(s):
559-568
The authors have been researching on reducing the power consumption of microprocessors, and developed a low-power processor called “Geyser” by applying power gating (PG) function to the individual functional units of the processor. PG function on Geyser reduces the power consumption of functional units by shutting off the power voltage of idle units. However, the energy overhead of switching the supply voltage for units on and off causes power increases. The amount of the energy overhead varies with the behavior of each functional unit which is influenced by running application, and also with the core temperature. It is therefore necessary to switch the PG function itself on or off according to the state of the processor at runtime to reduce power consumption more effectively. In this paper, the authors propose a PG control method to take the power overhead into account by the operating system (OS). In the proposed method, for achieving much power reduction, the OS calculates the power consumption of each functional unit periodically and inhibits the PG function of the unit whose energy overhead is judged too high. The method was implemented in the Linux process scheduler and evaluated. The results show that the average power consumption of the functional units is reduced by up to 17.2%.
An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs
Tomoya ITSUBO Michihiro KOIBUCHI Hideharu AMANO Hiroki MATSUTANI

PAPER

Pubricized:
2021/07/01
Vol:
E104-D No:12
Page(s):
2057-2067
Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.
A Multi-Tenant Resource Management System for Multi-FPGA Systems
Miho YAMAKURA Ryousei TAKANO Akram BEN AHMED Midori SUGAYA Hideharu AMANO

PAPER

Pubricized:
2021/10/08
Vol:
E104-D No:12
Page(s):
2078-2088
FPGA (Field Programmable Gate Array) based accelerators are attracting significant interest in cloud computing systems. Combining multi-FPGA systems with cloud computing brings a new perspective to the reconfigurable computing research. However, the multi-tenancy of a multi-FPGA system has not been fully discussed in the previous researches. In this paper, we propose a multi-tenant resource management system, named FiC-RM, for a multi-FPGA cloud system. FiC-RM provides users with a set of FPGA resources according to their requirements and allows them to exclusively access FPGA boards and the interconnection network. To achieve this, we propose a placement algorithm which is a key to efficiently share the limited resources. We demonstrate FiC-RM controls a practical scale multi-FPGA system. Moreover, Our simulation study shows that our placement algorithm achieved 3 to 4% improvement in the average resource usage and a 20-second reduction in the response time, compared to other existing naive algorithms.
Boosting the Performance of Interconnection Networks by Selective Data Compression
Naoya NIWA Hideharu AMANO Michihiro KOIBUCHI

PAPER

Pubricized:
2022/07/12
Vol:
E105-D No:12
Page(s):
2057-2065
This study presents a selective data-compression interconnection network to boost its performance. Data compression virtually increases the effective network bandwidth. One drawback of data compression is a long latency to perform (de-)compression operation at a compute node. In terms of the communication latency, we explore the trade-off between the compression latency overhead and the reduced injection latency by shortening the packet length by compression algorithms. As a result, we present to selectively apply a compression technique to a packet. We perform a compression operation to long packets and it is also taken when network congestion is detected at a source compute node. Through a cycle-accurate network simulation, the selective compression method using the above compression algorithms improves by up to 39% the network throughput with a moderate increase in the communication latency of short packets.
A Perpetuum Mobile 32bit CPU on 65nm SOTB CMOS Technology with Reverse-Body-Bias Assisted Sleep Mode
Koichiro ISHIBASHI Nobuyuki SUGII Shiro KAMOHARA Kimiyoshi USAMI Hideharu AMANO Kazutoshi KOBAYASHI Cong-Kha PHAM

PAPER

Vol:
E98-C No:7
Page(s):
536-543
A 32bit CPU, which can operate more than 15 years with 220mAH Li battery, or eternally operate with an energy harvester of in-door light is presented. The CPU was fabricated by using 65nm SOTB CMOS technology (Silicon on Thin Buried oxide) where gate length is 60nm and BOX layer thickness is 10nm. The threshold voltage was designed to be as low as 0.19V so that the CPU operates at over threshold region, even at lower supply voltages down to 0.22V. Large reverse body bias up to -2.5V can be applied to bodies of SOTB devices without increasing gate induced drain leak current to reduce the sleep current of the CPU. It operated at 14MHz and 0.35V with the lowest energy of 13.4 pJ/cycle. The sleep current of 0.14µA at 0.35V with the body bias voltage of -2.5V was obtained. These characteristics are suitable for such new applications as energy harvesting sensor network systems, and long lasting wearable computers.
Remote Dynamic Reconfiguration of a Multi-FPGA System FiC (Flow-in-Cloud)
Kazuei HIRONAKA Kensuke IIZUKA Miho YAMAKURA Akram BEN AHMED Hideharu AMANO

PAPER-Computer System

Pubricized:
2021/05/12
Vol:
E104-D No:8
Page(s):
1321-1331
Multi-FPGA systems have been receiving a lot of attention as a low cost and energy efficient system for Multi-access Edge Computing (MEC). For such purpose, a bare-metal multi-FPGA system called FiC (Flow-in-Cloud) is under development. In this paper, we introduce the FiC multi FPGA cluster which is applied partial reconfiguration (PR) FPGA design flow to support online user defined accelerator replacement while executing FPGA interconnection network and its low-level multiple FPGA management software called remote PR manager. With the remote PR manager, the user can define the FiC FPGA cluster setup by JSON and control the cluster from user application with the cooperation of simple cluster management tool / library called ficmgr on the client host and REST API service provider called ficwww on Raspberry Pi 3 (RPi3) on each node. According to the evaluation results with a prototype FiC FPGA cluster system with 12 nodes, using with online application replacement by PR and on-the-fly FPGA bitstream compression, the time for FPGA bitstream distribution was reduced to 1/17 and the total cluster setup time was reduced by 21∼57% than compared to cluster setup with full configuration FPGA bitstream.
FiC-RNN: A Multi-FPGA Acceleration Framework for Deep Recurrent Neural Networks
Yuxi SUN Hideharu AMANO

PAPER-Computer System

Pubricized:
2020/09/24
Vol:
E103-D No:12
Page(s):
2457-2462
Recurrent neural networks (RNNs) have been proven effective for sequence-based tasks thanks to their capability to process temporal information. In real-world systems, deep RNNs are more widely used to solve complicated tasks such as large-scale speech recognition and machine translation. However, the implementation of deep RNNs on traditional hardware platforms is inefficient due to long-range temporal dependence and irregular computation patterns within RNNs. This inefficiency manifests itself in the proportional increase in the latency of RNN inference with respect to the number of layers of deep RNNs on CPUs and GPUs. Previous work has focused mostly on optimizing and accelerating individual RNN cells. To make deep RNN inference fast and efficient, we propose an accelerator based on a multi-FPGA platform called Flow-in-Cloud (FiC). In this work, we show that the parallelism provided by the multi-FPGA system can be taken advantage of to scale up the inference of deep RNNs, by partitioning a large model onto several FPGAs, so that the latency stays close to constant with respect to increasing number of RNN layers. For single-layer and four-layer RNNs, our implementation achieves 31x and 61x speedup compared with an Intel CPU.
Iterative Synthesis Methods Estimating Programmable-Wire Congestion in a Dynamically Reconfigurable Processor
Takao TOI Takumi OKAMOTO Toru AWASHIMA Kazutoshi WAKABAYASHI Hideharu AMANO

PAPER-High-Level Synthesis and System-Level Design

Vol:
E94-A No:12
Page(s):
2619-2627
Iterative synthesis methods for making aware of wire congestion are proposed for a multi-context dynamically reconfigurable processor (DRP) with a large number of processing elements (PEs) and programmable-wire connections. Although complex data-paths can be synthesized using the programmable-wire, its delay is long especially when wire connections are congested. We propose two iterative synthesis techniques between a high-level synthesizer (HLS) and the place & route tool to shorten the prolonged wire delay. First, we feed back wire delays for each context to a scheduler in the HLS. The experimental results showed that a critical-path delay was shorten by 21% on average for applications with timing closure problems. Second, we skip the routing and estimate wire delays based on the congestion. The synthesis time was shorten to 1/3 causing delay improvement rate degradation at two points on average.
Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access
Hiroshi NAKAMURA Weihan WANG Yuya OHTA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Mitaro NAMIKI

INVITED PAPER

Vol:
E96-C No:4
Page(s):
404-412
Power consumption has recently emerged as a first class design constraint in system LSI designs. Specially, leakage power has occupied a large part of the total power consumption. Therefore, reduction of leakage power is indispensable for efficient design of high-performance system LSIs. Since 2006, we have carried out a research project called “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs”, supported by Japan Science and Technology Agency as a CREST research program. One of the major objectives of this project is reducing the leakage power consumption of system LSIs by innovative power control through tight cooperation and co-optimization of circuit technology, architecture, and system software designs. In this project, we focused on power gating as a circuit technique for reducing leakage power. Temporal granularity is one of the most important issue in power gating. Thus, we have developed a series of Geysers as proof-of-concept CPUs which provide several mechanisms of fine-grained run-time power gating. In this paper, we describe their concept and design, and explain why co-optimization of different design layers are important. Then, three kinds of power gating implementations and their evaluation are presented from the view point of power saving and temporal granularity.
A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing
Ryuta KAWANO Hiroshi NAKAHARA Seiichi TADE Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Computer System

Pubricized:
2017/05/19
Vol:
E100-D No:8
Page(s):
1798-1806
Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|⋅log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.
A Layout-Oriented Routing Method for Low-Latency HPC Networks
Ryuta KAWANO Hiroshi NAKAHARA Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Interconnection networks

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2796-2807
End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.
Body Bias Domain Partitioning Size Exploration for a Coarse Grained Reconfigurable Accelerator
Yusuke MATSUSHITA Hayate OKUHARA Koichiro MASUYAMA Yu FUJITA Ryuta KAWANO Hideharu AMANO

PAPER-Architecture

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2828-2836
Body biasing can be used to control the leakage power and performance by changing the threshold voltage of transistors after fabrication. Especially, a new process called Silicon-On-Thin Box (SOTB) CMOS can control their balance widely. When it is applied to a Coarse Grained Reconfigurable Array (CGRA), the leakage power can be much reduced by precise bias control with small domain size including a small number of PEs. On the other hand, the area overhead for separating power domain and delivering a lot of wires for body bias voltage supply increases. This paper explores the grain of domain size of an energy efficient CGRA called CMA (Cool Mega Array). By using Genetic Algorithm based body bias assignment method, the leakage reduction of various grain size was evaluated. As a result, a domain with 2x1 PEs achieved about 40% power reduction with a 6% area overhead. It has appeared that a combination of three body bias voltages; zero bias, weak reverse bias and strong reverse bias can achieve the optimal leakage reduction and area overhead balance in most cases.
Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: A Practical Approach
Carlos Cesar CORTES TORRES Hayate OKUHARA Nobuyuki YAMASAKI Hideharu AMANO

PAPER-Computer System

Pubricized:
2018/01/12
Vol:
E101-D No:4
Page(s):
1116-1125
In the past decade, real-time systems (RTSs), which must maintain time constraints to avoid catastrophic consequences, have been widely introduced into various embedded systems and Internet of Things (IoTs). The RTSs are required to be energy efficient as they are used in embedded devices in which battery life is important. In this study, we investigated the RTS energy efficiency by analyzing the ability of body bias (BB) in providing a satisfying tradeoff between performance and energy. We propose a practical and realistic model that includes the BB energy and timing overhead in addition to idle region analysis. This study was conducted using accurate parameters extracted from a real chip using silicon on thin box (SOTB) technology. By using the BB control based on the proposed model, about 34% energy reduction was achieved.

41-60hit(66hit)

Author Search Result

[Author] Hideharu AMANO(66hit)

The Implementation of a Hybrid Router and Dynamic Switching Algorithm on a Multi-FPGA System

A Mapping Method for Multi-Process Execution on Dynamically Reconfigurable Processors

Partial Reconfiguration of Flux Limiter Functions in MUSCL Scheme Using FPGA

Message Transfer Algorithms on the Recursive Diagonal Torus

High-Speed Fully-Adaptable CRC Accelerators

A Port Combination Methodology for Application-Specific Networks-on-Chip on FPGAs

Recovering Faulty Non-Volatile Flip Flops for Coarse-Grained Reconfigurable Architectures

A Fine-Grained Power Gating Control on Linux Monitoring Power Consumption of Processor Functional Units

An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

A Multi-Tenant Resource Management System for Multi-FPGA Systems

Boosting the Performance of Interconnection Networks by Selective Data Compression

A Perpetuum Mobile 32bit CPU on 65nm SOTB CMOS Technology with Reverse-Body-Bias Assisted Sleep Mode

Remote Dynamic Reconfiguration of a Multi-FPGA System FiC (Flow-in-Cloud)

FiC-RNN: A Multi-FPGA Acceleration Framework for Deep Recurrent Neural Networks

Iterative Synthesis Methods Estimating Programmable-Wire Congestion in a Dynamically Reconfigurable Processor

Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access

A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing

A Layout-Oriented Routing Method for Low-Latency HPC Networks

Body Bias Domain Partitioning Size Exploration for a Coarse Grained Reconfigurable Accelerator

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: A Practical Approach

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles